Add Parquet variant shredding support by CurtHagenlocher · Pull Request #332 · apache/arrow-dotnet

CurtHagenlocher · 2026-04-26T14:56:17Z

What's Changed

Implements the Parquet variant shredding spec end-to-end in a new Apache.Arrow.Operations.Shredding namespace, alongside minor changes to the base scalar and array types.

Operations.Shredding reader side:

ShreddedVariant / ShreddedObject / ShreddedArray ref-struct trio exposing typed columns and residual bytes side-by-side.
VariantArrayShreddingExtensions adds GetShreddedVariant(i) and GetLogicalVariantValue(i) on VariantArray.
ShredSchema.FromArrowType derives a shredding schema from an Arrow typed_value type, rejecting unsupported types (uint32, fixed-size-binary(N≠16)).

Operations.Shredding producer side:

VariantShredder decomposes a column of VariantValues against a ShredSchema into shared metadata + per-row ShredResults.
ShreddedVariantArrayBuilder assembles those into a shredded VariantArray with a typed_value Arrow tree matching the schema.

Apache.Arrow changes:

VariantExtensionDefinition accepts struct<metadata, value?, typed_value?> layouts in addition to the plain unshredded form.
VariantType gains IsShredded / HasValueColumn / HasTypedValueColumn / TypedValueField properties.
VariantArray.GetVariantValue and GetVariantReader throw on shredded columns with a pointer to the Operations.Shredding extensions.
The public VariantArray(IArrowArray) constructor now infers the VariantType (shredded or not) from the storage shape.
Operations gains a project reference to Apache.Arrow; Apache.Arrow does not reference Operations.

Apache.Arrow.Scalars changes:
VariantValueWriter.CopyValue(VariantReader source) transcodes a reader into this writer, re-resolving field IDs against the writer's metadata dictionary. Supports cross-dictionary transcoding and multi-source merge-into-one-dictionary workflows.
VariantMetadataBuilder.CollectFieldNames(VariantReader source) is the two-pass companion that accumulates source field names into the target metadata builder.

Validation:

Conformance tests run against the Iceberg shredded-variant corpus in apache/parquet-testing (test/parquet-testing/shredded_variant/). test/shredded_variant_ipc/regen.py converts each case-NNN.parquet to an Arrow IPC file via pyarrow; 137 resulting .arrow files are checked in so CI needs no Python. All 128 valid conformance cases pass; 6 schema-invalid and data-invalid cases are rejected with clear errors; 3 "spec-invalid but permissive" INVALID cases are documented as read-without-throw.
Additional round-trip, reader-style, and builder tests were implemented

Copilot

Pull request overview

Adds end-to-end Parquet shredded-variant support (reader + producer) under Apache.Arrow.Operations.Shredding, with supporting enhancements to Arrow Variant scalar/array APIs and conformance fixtures converted to Arrow IPC for CI.

Changes:

Introduces Apache.Arrow.Operations.Shredding types (e.g., ShredType, ShredOptions, and shared helpers) to represent and operate on shredded typed_value layouts.
Extends Variant scalar tooling with cross-metadata transcoding support (VariantValueWriter.CopyValue) and a metadata prepass helper (VariantMetadataBuilder.CollectFieldNames).
Adds a regeneration script and checks in Arrow IPC fixtures converted from the Parquet shredded-variant corpus.

Reviewed changes

Copilot reviewed 29 out of 166 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
test/shredded_variant_ipc/regen.py	Script to regenerate Arrow IPC fixtures from the parquet-testing shredded-variant corpus.
test/shredded_variant_ipc/case-*.arrow (many files)	Checked-in Arrow IPC fixtures generated from the shredded-variant Parquet test corpus.
src/Apache.Arrow.Scalars/Variant/VariantValueWriter.cs	Adds `CopyValue(VariantReader)` to transcode values while re-resolving field IDs against a target metadata dictionary.
src/Apache.Arrow.Scalars/Variant/VariantValue.cs	Adds `FromDecimal16(SqlDecimal)` to preserve Decimal16 intent and support values beyond `decimal` range.
src/Apache.Arrow.Scalars/Variant/VariantMetadataBuilder.cs	Adds `CollectFieldNames(VariantReader)` for two-pass encode workflows.
src/Apache.Arrow.Operations/Shredding/ShreddingHelpers.cs	Adds shared helper to construct per-row `ShreddedVariant` slots from element-group structs.
src/Apache.Arrow.Operations/Shredding/ShredType.cs	Defines the shredding type system for `typed_value` columns (primitive + object/array).
src/Apache.Arrow.Operations/Shredding/ShredOptions.cs	Adds schema inference tuning options (depth, frequency, type consistency).
src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj	Adds a project reference to `Apache.Arrow` to support shredding operations over Arrow arrays/types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T15:01:01Z

+                SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale);
+                return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized);
+            }
+            return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);


FromDecimal16(SqlDecimal) converts to decimal via value.Value when value.Data[3] == 0. SqlDecimal.Value can still throw for values that aren't representable as System.Decimal (e.g., scale/precision beyond decimal’s limits) even when the magnitude fits in 96 bits. Consider storing the SqlDecimal instance in those cases (or using a try/catch fallback) so Decimal16 materialization can’t unexpectedly overflow.

Suggested change

return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);

try

{

return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);

}

catch (OverflowException)

{

SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale);

return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized);

}

I think in a followup change I'm going to always store a Decimal16 as a SqlDecimal and vice versa. The current "convert to decimal if it fits" strategy is unnecessarily complicated. Filed #33 to cover this.

CurtHagenlocher · 2026-04-26T15:52:43Z

+        {
+            StructType elementGroupType = (StructType)elementGroup.Data.DataType;
+            int valueIdx = elementGroupType.GetFieldIndex("value");
+            int typedIdx = elementGroupType.GetFieldIndex("typed_value");


These should probably be cached; need to take a second look.

(Shouldn't affect the public API, so can be done as a followup.)

adamreeve

I started reviewing this but didn't get very far, I'll just leave the couple of comments I have for now

adamreeve

This all looks good to me thanks Curt, only a few minor comments.

adamreeve · 2026-04-29T01:39:46Z

+            BinaryArray metadataArr = metadataBuilder.Build(allocator);
+
+            // value column: residual bytes (or null).
+            BinaryArray valueArr = BuildBinaryColumn(rows, allocator);


Should we omit the value array if values are fully shredded? Probably fine to add that as an optimisation later though if there's a need for it.

My concern with doing that is a hypothetical scenario where we're shredding a column in a very large table. We get the values as an IArrowArrayStream instead of an IArrowArray and we run each of the batches through ShredSchemaInferrer, leaving us with a ShredSchema. Now we take a second pass through the IArrowArrayStream and shred the batches, one at a time. Each of the batches will need to conform to the shredded schema and we can't just omit values in one of them without knowing whether or not it can be omitted in all of them.

In short, I think this would require a separate knob based on the bigger picture.

CurtHagenlocher added 3 commits April 24, 2026 19:16

Add Parquet variant shredding support

0156650

Fix test and implement first round of feedback from Copilot.

4c41528

Merge from main

2d2e5e1

CurtHagenlocher requested a review from Copilot April 26, 2026 14:56

Copilot started reviewing on behalf of CurtHagenlocher April 26, 2026 14:56 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Fix build issues

438b56c

CurtHagenlocher commented Apr 26, 2026

View reviewed changes

CurtHagenlocher requested a review from adamreeve April 26, 2026 21:47

adamreeve reviewed Apr 28, 2026

View reviewed changes

Comment thread test/shredded_variant_ipc/regen.py

Comment thread src/Apache.Arrow.Operations/Shredding/ShreddedArray.cs Outdated

Apply initial PR feedback.

07b3593

adamreeve approved these changes Apr 29, 2026

View reviewed changes

PR feedback.

a8abf49

CurtHagenlocher merged commit 8e35d8f into apache:main Apr 29, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parquet variant shredding support#332

Add Parquet variant shredding support#332
CurtHagenlocher merged 6 commits intoapache:mainfrom
CurtHagenlocher:VariantShredding

CurtHagenlocher commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

CurtHagenlocher Apr 26, 2026 •

edited

Loading

Uh oh!

CurtHagenlocher Apr 26, 2026

Uh oh!

CurtHagenlocher Apr 26, 2026

Uh oh!

adamreeve left a comment

Uh oh!

Uh oh!

Uh oh!

adamreeve left a comment

Uh oh!

Uh oh!

Uh oh!

adamreeve Apr 29, 2026

Uh oh!

CurtHagenlocher Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CurtHagenlocher commented Apr 26, 2026

What's Changed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

adamreeve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adamreeve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adamreeve Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CurtHagenlocher Apr 26, 2026 •

edited

Loading